MEDB 5501, Module07

2024-08-27

Topics to be covered

  • What you will learn
    • Categorical independent variables
    • R code for multiple linear regression
    • Multiple linear regression
    • R code for categorical independent variables
    • Diagnostic plots and multicollinearity
    • R code for diagnostic plots and multicollinearity
    • Your homework

Categorical independent variables

Break #1

  • What you have learned
    • Categorical independent variables
  • What’s coming next
    • R code for multiple linear regression

fev data dictionary, 1

data_dictionary: fev (.csv, sas7bdat, .sav, .txt)
copyright: |
  The author of the jse article holds the copyright, but does not list conditions under which it can be used. Individual use for educational purposes is probably permitted under the Fair Use provisions of U.S. Copyright laws.
description: |
  Forced Expiratory Volume (FEV) in children. The data was collected  in Boston in the 1970s.
additional_description: https://jse.amstat.org/v13n2/datasets.kahn.html

fev data dictionary, 2

download_url: https://www.amstat.org/publications/jse/datasets/fev.dat.txt
format:
  csv: comma delimited
  sas7bdat: proprietary (SAS)
  sav: proprietary (SPSS)
  txt: fixed width
varnames: not included
missing_value_code: not needed
size:
  rows: 654
  columns: 5

fev data dictionary, 3

age:
  scale: ratio
  range: positive integer
  unit: years
fev:
  label: Forced Expiratory Volume
  scale: ratio
  range: positive real
  unit: liters
ht:
  label: Height
  scale: positive real
  unit: inches

fev data dictionary, 4

sex:
  value:
    F: Female
    M: Male
smoke:
  value:
    'FALSE': Nonsmoker
    'TRUE': Smoker

simon-5501-07-fev.qmd, 1

---
title: "Template for 5501-07 programming assignment"
author: "Steve Simon"
format: 
  html:
    embed-resources: true
date: 2024-09-25
---

There is a [data dictionary][dd] that provides more details about the data. The program was written by Steve Simon on 2024-09-02 and is placed in the public domain.

[dd]: https://github.com/pmean/datasets/blob/master/fev.yaml

simon-5501-07-fev.qmd, 2

## Libraries

You should always load the tidyverse library. The broom library provides the glance, tidy, and augment functions that help you with computations of linear regression models. The car library provides the vif function for measuing collinearity.

```{r setup}
#| message: false
#| warning: false
library(broom)
library(car)
library(tidyverse)
```

simon-5501-07-fev.qmd, 3

## List variable names

Since the variable names are not listed in the data file itself, you need to list them here.

```{r names}
pulmonary_names <- c(
    "age",
    "fev",
    "ht",
    "sex",
    "smoke")
```

simon-5501-07-fev.qmd, 4

## Reading the data

Here is the code to read the data and show a glimpse. 

```{r read}
pulmonary <- read_csv(
  file="../data/fev.csv",
  col_names=pulmonary_names,
  col_types="nnncc")
glimpse(pulmonary)
```

simon-5501-07-fev.qmd, 5

## m1: Linear regression model using sex to predict fev

Is there a relationship between sex and fev? Do males tend to have larger fev values than females. This section (labelled m1) shows some simple descriptive and graphical summaries followed by a linear regression model.

simon-5501-07-fev.qmd, 6

## m1: Descriptive stastics for sex

```{r sex}
pulmonary |>
    count(sex) |>
    mutate(total=sum(n)) |>
    mutate(pct=100*n/total)
```

There are slightly more males (51%) than females in this sample.

simon-5501-07-fev.qmd, 7

## m1: Descriptive statistics for fev

You do not need to specify na.rm=TRUE for any of the descriptive statistics because there are no missing values.

```{r fev}
pulmonary |>
  summarize(
    fev_mn=mean(fev),
    fev_sd=sd(fev),
    fev_min=min(fev),
    fev_max=max(fev))
```

The average fev value, 2.6, seems reasonable. The standard deviation, 0.87, indicates a fair amount of variation. The minimum and maximum values both appear to be reasonable.

simon-5501-07-fev.qmd, 8

## m1: Tabular summary of the relationship between sex and fev, 1

```{r sex-and-fev-1}
pulmonary |>
  group_by(sex) |>
    summarize(
        fev_mn=mean(fev),
        fev_sd=sd(fev))
```

The average fev is 0.36 liters higher for males than females. There is very slightly more variation in the males (the standard deviation is 1.00 versus 0.65).

simon-5501-07-fev.qmd, 9

## m1: Graphical summary of the relationship between sex and fev, 2

```{r sex-and-fev-2}
pulmonary |>
  ggplot(aes(sex, fev)) +
    geom_boxplot() + 
    coord_flip() +
    ggtitle("Graph drawn by Steve Simon on 2024-09-26") +
      xlab("Sex") +
      ylab("Forced Expiratory Volume (liters)")
```

The boxplot shows the same pattern slightly larger fev values for males compared to females and slightly more variation. 

simon-5501-07-fev.qmd, 10

## m1: Fit the linear regression model

```{r m1-model}
m1 <- lm(fev ~ sex, data=pulmonary)
m1
```

The estimated average fev is 2.45 for females and 0.36 inches higher for males.

simon-5501-07-fev.qmd, 11

## m1: Analysis of variance table

```{r m1-anova}
anova(m1)
```

The F-statistic, 29.6, is large, and the p-value is very small (< 0.001). Reject the null hypothesis and conclude that the average fev values are different between males and females.

simon-5501-07-fev.qmd, 12

## m1: R-squared

```{r m1-r-squared}
glance(m1)$r.squared
```

Sex is a very weak predictor of fev. Only 4% of the variation in fev values can be accounted for by a patient's sex.

simon-5501-07-fev.qmd, 13

## m1: Confidence intervals

```{r m1-ci}
confint(m1)
```

We are 95% confident that the difference in fev values is between 0.23 and 0.49. This is a positive difference for all values in the confidence interval, demonstrating that the average fev values are larger for males than for females. This interval is narrow indicating that there is very little sampling error. Hardly a surprise with such a large dataset (n=654).

simon-5501-07-fev.qmd, 14

## m1: T-tests

```{r m1-t-tests}
tidy(m1)
```

The t-statistic, 5.4, is not close to zero. Conclude that there is a difference in average fev values between males and females. Since the t-statistic is positive, conclude also that the average fev value is larger for males.

simon-5501-07-fev.qmd, 15

## m1: Normal probability plot of residuals

```{r m1-qq-plot}
r1 <- augment(m1)
qqnorm(r1$.resid)
```

The straight lines indicates that the residuals are normally distributed.

simon-5501-07-fev.qmd, 16

## m1: Histogram of residuals

```{r m1-histogram}
r1 |>
    ggplot(aes(.resid)) +
      geom_histogram(
        binwidth=0.2,
        color="black",
        fill="white") +
    ggtitle("Graph drawn by Steve Simon on 2024-09-26") +
      xlab("Residuals from m1 regression model")
```

The histogram also indicates that the residuals are normally distributed.

simon-5501-07-fev.qmd, 17

## m1: Influential data points

Both leverage and Cook's distance make little sense for a regression model using a categorical independent variable. The studentized deleted residual is still useful.

```{r m1-studentized-deleted-residual}
r1 |>
    filter(abs(.std.resid) > 3)
```

There are three outliers on the high end, corresponding to fev values of 5.6, 5.6, and 5.8 liters in males. There are no outliers among the female patients.

simon-5501-07-fev.qmd, 18

## m2: Linear regression model using smoke to predict fev

Is there a relationship between smoke and fev? This section (labeled m2) shows some simple descriptive and graphical summaries followed by a linear regression model.

simon-5501-07-fev.qmd, 19

## m2: Descriptive statistics for smoke

```{r smoke}
pulmonary |>
    count(smoke) |>
    mutate(total=sum(n)) |>
    mutate(pct=100*n/total)
```

There are very few smokers (65 or 10%) in this sample. Descriptive statistics for fev were shown earlier.

simon-5501-07-fev.qmd, 20

## m2: Relationship between smoke and fev, 1

```{r smoke-and-fev-1}
pulmonary |>
  group_by(smoke) |>
    summarize(
        fev_mn=mean(fev),
        fev_sd=sd(fev))
```

Smokers have an average fev value that is 0.7 units higher than non-smokers. The standard deviations (0.85 and 0.75) demonstrate roughly the same amount of variation in the two groups.

Break #2

  • What you have learned
    • R code for multiple linear regression
  • What’s coming next
    • Multiple linear regression

Model

  • \(Y_i=\beta_0+\beta_1 X_{1i}+\beta_2 X_{2i}+\epsilon_i\)
  • Least squares estimates: \(b_0,\ b_1,\ b_2\)

Interpretations

  • \(b_0\) is the estimated average value of Y when X1 and X2 both equal zero.
  • \(b_1\) is the estimated average change in Y
    • when \(X_1\) increases by one unit, and
    • \(X_2\) is held constant
  • \(b_2\) is the estimated average change in Y
    • when \(X_2\) increases by one unit, and
    • \(X_1\) is held constant

Unadjusted relationship between height and FEV

Relationship between height and FEV controlling at Age=3

Relationship between height and FEV controlling at Age=3

Relationship between height and FEV controlling at Age=4

Relationship between height and FEV controlling at Age=5

Relationship between height and FEV controlling at Age=6

Relationship between height and FEV controlling at Age=7

Relationship between height and FEV controlling at Age=8

Relationship between height and FEV controlling at Age=9

Relationship between height and FEV controlling at Age=10

Relationship between height and FEV controlling at Age=11

Relationship between height and FEV controlling at Age=12

Relationship between height and FEV controlling at Age=13

Relationship between height and FEV controlling at Age=14

Relationship between height and FEV controlling at Age=15

Relationship between height and FEV controlling at Age=16

Relationship between height and FEV controlling at Age=17

Relationship between height and FEV controlling at Age=18

Relationship between height and FEV controlling at Age=19

Unadjusted relationship between age and fev

Relationship between age and FEV controlling for height between 46 and 49.5

Relationship between age and FEV controlling for height between 50 and 53.5

Relationship between age and FEV controlling for height between 54 and 57.5

Relationship between age and FEV controlling for height between 58 and 61.5

Relationship between age and FEV controlling for height between 62 and 65.5

Relationship between age and FEV controlling for height between 66 and 69.5

Relationship between age and FEV controlling for height between 70 and 73.5

Break #3

  • What you have learned
    • Multiple linear regression
  • What’s coming next
    • R code for categorical independent variables

simon-5501-07-fev.qmd, 21

## m2: Relationship between smoke and fev, 2

```{r smoke-and-fev-2}
pulmonary |>
  ggplot(aes(smoke, fev)) +
    geom_boxplot() + 
    coord_flip() +
    ggtitle("Graph drawn by Steve Simon on 2024-09-26") +
      xlab("Did the patient smoke?") +
      ylab("Forced Expiratory Volume (liters)")
```

The boxplot shows the same pattern as noted above.

simon-5501-07-fev.qmd, 22

## m2: Fit the linear regression model

```{r m2-model}
m2 <- lm(fev ~ smoke, data=pulmonary)
m2
```

The estimated average fev is 2.57 liters for non-smokers and 0.71 liters higher on average for smokers.

Normally, you would follow this up with various functions (anova, confint, tidy), assess various diagnostic plots using the residuals, and identify influential data points.

simon-5501-07-fev.qmd, 23

## m3: Linear regression model using age and ht to predict fev

Previous analysis has shown that age by itself is a strong predictor of fev and height by itself is a strong predictor of fev. A multiple linear regression model including both age and height should do an even better job in predicting fev. This model will also allow you to compare the relative importance of the two variables. Is fev most strongly associated with how big a child is or how old that child is?

You should always start with descriptive statistics and graphs. With two or more independent variables, you should also examine the correlations between the independent variables and their correlations with the dependent variable.

simon-5501-07-fev.qmd, 24

## m3: Descriptive statistics for age

```{r age}
pulmonary |>
  summarize(
    age_mn=mean(age),
    age_sd=sd(age),
    age_min=min(age),
    age_max=max(age))
```

The descriptive statistics are consistent with a pediatric study. The average age is 9.9 years. The standard deviation, 3.0, shows a large amount of variation in age (large at least for a pediatric study). The range, 3 years to 19 years, also demonstrates a large amount of variation.

simon-5501-07-fev.qmd, 25

## m3: Descriptive statistics for ht

```{r ht}
pulmonary |>
  summarize(
    ht_mn=mean(ht),
    ht_sd=sd(ht),
    ht_min=min(ht),
    ht_max=max(ht))
```

The average height, 61 inches, is reasonable for a group of children with an average age of around 10 years. The standard deviation, 5.7 inches, and the range, 46 inches to 74 inches, show a moderate amount of variation.

simon-5501-07-fev.qmd, 26

## m3: Correlations

Using [ , 1:3] reduces the pulmonary data frame to just the first three columns.

```{r corr}
cor(pulmonary[ , 1:3])
```

The two independent variables, age and ht, are both strongly correlated with fev (r=0.76 and 0.87). They are also strongly correlated with one another (r=0.79).

simon-5501-07-fev.qmd, 27

## m3: Predicting fev using age and ht

```{r m3}
m3 <- lm(fev ~ age + ht, data=pulmonary)
m3
```

The estimated average fev value increases by 0.05 liters for each increase of one year in age, holding height constant. The estimated average fev value increases by 0.11 liters for every increase of one inch in height. The estimated average fev is -4.6 for a patient of age zero with a height of zero inches. This is clearly and extrapolation beyond the range of the data.

simon-5501-07-fev.qmd, 28

## m3: Confidence intervals, 1

The use of [2, ] (and of [3, ]) isolates the individual rows of the data frame produced by confint. This is not really necessary, but is done to fit the information easily on separate slides.

```{r m3-ci-1}
confint(m3)[2, ]
```

We are 95% confident that the estimated average fev value increases between 0.036 and 0.072 liters for each increase of one year of age holding height constant. Conclude that there is a positive relationship between fev and age.

simon-5501-07-fev.qmd, 29

## m3: Confidence intervals, 2

```{r m3-ci-2}
confint(m3)[3, ]
```

We are 95% confident that the estimated average fev value increases between 0.10 and 0.12 liters for each increase of one inch in patient's height holding age constant. Conclude that there is a positive relationship between height and fev.

Do not interpret the confidence interval for the intercept.

simon-5501-07-fev.qmd, 30

## m3: Analysis of variance table

```{r m3-anova}
anova(m3)
```

The sum of squares total (SST) is 280.9 + 95.3 + 114.7 = 490.9. Only a small portion of the variation (114.7) is unexplained variation. The F-ratio is much larger than 1, and the p-value is less than 0.001. You can conclude that the combination of age and height helps significantly in predicting fev.

Break #4

  • What you have learned
    • R code for categorical independent variables
  • What’s coming next
    • Diagnostic plots and multicollinearity

Diagnostic plots and multicollinearity

Break #5

  • What you have learned
    • Diagnostic plots and multicollinearity
  • What’s coming next
    • R code for diagnostic plots and multicollinearity

simon-5501-07-fev.qmd, 31

## m3: R-squared

```{r m3-r-squared}
glance(m3)$r.squared
```

Roughly 77% of the variation in fev can be explained by the combination of the age and height of the patients.

simon-5501-07-fev.qmd, 32

## m3: t-tests

```{r}
tidy(m3)
```

simon-5501-07-fev.qmd, 33

## m3: Diagnostic plots, 1

```{r diagnostic-1}
r3 <- augment(m3)
qqnorm(r3$.resid)
```

simon-5501-07-fev.qmd, 34

## m3: Diagnostic plots, 2

```{r diagnostic-2}
r3 |>
  ggplot(aes(.resid)) +
    geom_histogram(
      binwidth=0.2,
      color="black",
      fill="white") +
    xlab("Residuals from m3 regression") +
    ggtitle("Graph drawn by Steve Simon on 2024-09-25")
```

simon-5501-07-fev.qmd, 35

## m3: Diagnostic plots, 3

```{r diagnostic-3}
r3 |>
  ggplot(aes(.fitted, .resid)) +
    geom_point() +
    xlab("Predicted values from m3 regression") +
    ylab("Residuals from m3 regression") +
    ggtitle("Graph drawn by Steve Simon on 2024-09-25")
```

simon-5501-07-fev.qmd, 36

## m3: Influential data points, 1

```{r influence-1}
n <- nrow(r3)
r3 |> filter(.hat > 3*3/n)
```

simon-5501-07-fev.qmd, 37

## m3: Influential data points, 2

```{r influence-2}
r3 |> 
  filter(abs(.std.resid) > 3)
```

simon-5501-07-fev.qmd, 38

## m3: Influential data points, 3

```{r influence-3}
r3 |> 
  filter(.cooksd > 1)
```

simon-5501-07-fev.qmd, 39

## m3: Variance inflation factor

```{r vif}
library(car)
vif(m3)
```

Break #6

  • What you have learned
    • R code for diagnostic plots and multicollinearity
  • What’s coming next
    • Your homework

simon-5501-07-directions.qmd, 1

---
title: "Directions for 5501-01 programming assignment"
author: "Steve Simon"
format: 
  html:
    embed-resources: true
date: 2024-08-18
---

This code is placed in the public domain.

simon-5501-07-directions.qmd, 2

## Setup

-   Download the [template][tem]
    -   Store it in your src folder
-   Modify the file name
    -   Use your last name instead of "simon"
-   Modify the documentation header
    -   Add your name to the author field
    -   Optional: change the copyright statement
-   Download the [data file][dat]
    -   Store it in your data folder

[tem]: https://github.com/pmean/classes/blob/master/biostats-1/01/src/simon-5501-01-template.qmd
[dat]: https://github.com/pmean/datasets/blob/master/albuquerque-housing.csv
    

simon-5501-07-directions.qmd, 3

## Interpret the output

I provided an interpretation for mean
price. Provide a similar interpretation 
for the other three means. Be sure to

-   use a descriptive name
-   include the units of measurement, if appropriate
-   round to two or three significant digits
-   use comma separators for numbers 1,000 or larger

simon-5501-07-directions.qmd, 4

## Your submission

-   Save the output in html format
-   Convert it to pdf format.
-   Make sure that the pdf file includes
    -   Your last name
    -   The number of this course
    -   The number of this module
-   Upload the file

Summary

  • What you have learned
    • Categorical independent variables
    • R code for multiple linear regression
    • Multiple linear regression
    • R code for categorical independent variables
    • Diagnostic plots and multicollinearity
    • R code for diagnostic plots and multicollinearity
    • Your homework